GRADE:

82/100. Be more mindful of the justifications for your answers!.

Instructions:

Section 1: Using basic commands to analyze sequence data

Today and next week we will start analysing some sequence data in the command line.

Nothing too fancy, just some basic commands to analyze data and summarize information from the sequences.

Downloading the datasets

Today’s dataset are going to be found in two main files.

These are the addresses to download them:

https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/AAB21188.seq
https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/S79522.seq
  1. Create a folder called Lab_3 in your MBB101 folder at the cluster
  2. Download the files. Answer the following question:

What commands you used to download the files?

Add you answers here: 
wget https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/AAB21188.seq
wget https://raw.githubusercontent.com/Tabima/MBB101/master/Lab_3/S79522.seq

  1. Check the files. Answer the following questions:

What kind of sequence (DNA/RNA/Protein) is the S79522.seq file? Briefly justify your answer.

Add you answer here: List of base pairs of DNA

How are you justifiyng the answer? (-2 pts)

What kind of of sequence (DNA/RNA/Protein) is the AAB21188.seq file? Briefly justify your answer.

Add you answer here: A list of amino acids starting with at Met

How are you justifiyng the answer? (-2 pts)


Counting basic patterns for DNA data

Ok, after checking the files we know now that the S79522.seq file contains a DNA sequence.

Lets do some command line exercises with it. Answer all of these questions:


How many lines are found on this file?

Add you answer here: 1 Lines

How many characters are found on this file? (Or, what is the length of this sequence?)

Add you answer here: 541 characters

How many A, C, G, and T can you find in this sequence?

Add you answer here: A = 171
                     C = 104
                     G = 131
                     T = 134

Convert this sequence to RNA and save it in a file called S79522_RNA.seq. Please add the command you used to create the file. (Don’t create a complementary strand, just convert this one from DNA to RNA)

Add you answer here: sed 's/T/U/g' S79522.seq > S79522_RNA.seq

How can we test that this new file is an RNA sequence S79522_RNA.seq. Please add a command you’d use to check if this is DNA or RNA.

Add you answer here: cat the file and check for uricil or grep the file looking for U and if it returns True it should be RNA

Finally, using your RNA sequence, check how many start codons (AUG) or stop codons (UAG,UAG, or UGA) may exist in your sequence

Add your answer here: All of the codons have one each 1 AUG, 1 UAG, 1 UGA, 1 UAA

CAm, this is wrong. The answer is 11 start codon and there are 3 UAG, 10 UAA, and 14 UGA (-5 points)

Counting basic patterns for Aminoacid data

Ok, after checking the files we know now that the S79522.seq file contains a DNA sequence.

Lets do some command line exercises with it. Answer all of these questions:

How many characters are found on this file? (Or, what is the length of this sequence?)

Add you answer here: 541 Characters

Select two aminoacids of interest and count them. Please add the command used to count them.

Add you answer here: 4 Total Arginines, and 2 Histinines

Cam, where is the code? (-2 pts)

We have four candidate patterns for this gene:

  • VLKYYKVD (A common pattern found in the Ribosomal S27 subunit domains)
  • ITLEVE (A common pattern found in Ubiquitin domains)
  • KWCPRP (A common pattern found in In Between Ring fingers (IBR) domains)
  • LSVLSQKM (A common pattern found in DNA polymerase unit A domains)

Which patterns do you find in the sequence?

Sequence Ribosomal 27 Ubiquitin IBR DNA poly A
AAB21188 Yes Yes No No

Section 2: Learning pipes to create pipelines

In this section we will learn something new and exciting:

We will make a simple pipeline to automate counting nucleotide sequences.

Creating a pipeline

Pipelines are sequential lines of code that can be used to use multiple commands at once.

For example, if you want to count all the U in a DNA sequence that you have transformed into DNA, you can actually do this in one simple line of code.

Think about this as a logical set of steps (or, a pseudocode):

  1. Read the DNA file
  2. Replace the T for U
  3. Extract the matches to U
  4. Count those matches

Using the command line the intructions would be like:

cat file.txt
sed 's/T/U/g' file.txt > file_rna.txt
grep -o U file_rna.txt > U_ocurrences.txt
wc -l U_ocurrences.txt

This code works, but its quite useless as it forces us to create two independent files: file_rna.txt and U_ocurrences.txt.

Imagine that the output of these files were large and you had no space left in your hard drive. It would ruin the program

One of the options is to use pipelines, using the pipe (|) modifier. This modifier works as an an then.

So for example, a command that is like this: ls -lrt | grep seq | wc -l can be read as:

Make a list of all the files and then extract all the matches to seq and then count the number of occurences.

So, we can totally use this with out data.

Pipeline example using DNA data

Before we start, let me introduce you to two elements:

  1. Variables
  2. the echo command.

Variables

Variables can be defined as a place to temporarly store a piece of information.

For example, you wallet is a variable that is full of your money.

So, if you have $10 in your wallet, then your variable is:

wallet=10

So, that is how you define a variable in bash. To see what the value of that variable is, you use the echo command

The echo command

The echo command will print to your screen whatever you tell it to print.

For example, if you tell the command to echo 'hello', this will happen:

echo 'hello'
## hello

Meaning that you can tell it to say whatever you want to say. If you want to add spaces or other weird characters, make sure you surrond the statement with single quotes

echo 'This is the best course ever'
## This is the best course ever

Mixing variables and echo

So, we want to mix these two because then you can store a variable and print it in the screen, for example:

statement='This is the best course ever'
echo $statement
## This is the best course ever

Look at this! You can simplify the instructions and print variables!

Now, did you notice that the variable statement had a $ in front? That is how you tell bash that you are using a variable!

If you don’t add it this is what happens:

echo statement
## statement

So, you are just printing the word statement. Interesting ah?

Using a pipeline as an example of variables and pipes

Now, lets say we want to count all the nucelotides in our DNA file independently.

So, we want to extract each nucleotides, starting from the A, using grep -c A, and then counting lines with wc -l:

grep -o A Lab_3/S79522.seq | wc -l
## grep: Lab_3/S79522.seq: No such file or directory
##        0

Answer the following question:


Did these results match what you did at the beginning of the lab? Expand briefly your answer.

Add you answer here: Yes we got 171 for 'A' and thats what we got in the beginning aswell


Ok, now, to save this on a variable (you can call the variable anything as long as it doesn’t have any spaces or weird characters) we just put the entire pipeline inside a $(), like this:

nucA=$(grep -o A Lab_3/S79522.seq | wc -l)
echo $nucA
## grep: Lab_3/S79522.seq: No such file or directory
## 0

WOW! that was easy!

Now, if you want to add more variables you can do it like this:

nucA=$(grep -o A Lab_3/S79522.seq | wc -l)
nucC=$(grep -o C Lab_3/S79522.seq | wc -l)

echo 'Number of A:'$nucA
echo 'Number of C:'$nucC
## grep: Lab_3/S79522.seq: No such file or directory
## grep: Lab_3/S79522.seq: No such file or directory
## Number of A: 0
## Number of C: 0

Now, answer the following question:

How would you count the other nucleotides?. Add the code below:

#Add you answer here: nuCT=$(grep -o T S79522.seq | wc -l)
                     # nuCU=$(grep -o U S79522.seq | wc -l)


Homework 1:

Write a set of instructions to count ten of the aminoacids in the AAB21188.seq file.

#Add you answer here: grep -oi 'E' AAB21188.seq | wc -l
# grep -oi 'K' AAB21188.seq | wc -l
# grep -oi 'D' AAB21188.seq | wc -l
# grep -oi 'R' AAB21188.seq | wc -l
# grep -oi 'I' AAB21188.seq | wc -l
# grep -oi 'S' AAB21188.seq | wc -l
# grep -oi 'A' AAB21188.seq | wc -l
# grep -oi 'N' AAB21188.seq | wc -l
# grep -oi 'P' AAB21188.seq | wc -l
# grep -oi 'C' AAB21188.seq | wc -l

Section 3: Using Cyberduck to upload local files into the cluster

MAKE SURE YOU ARE CONNECTED TO THE VPN!

So, we know how to download files from different sources into the cluster. Today, we are going to learn how to move files from and to the cluster using a program called cyberduck.

  1. Download cyberduck from this page. Download the version for your operating system and install it.

  2. Open Cyberduck. To connect to the cluster follow these instructions:

  1. You have landed in your cluster folder at /home/username!

Now you can navigate the cluster using your mouse like if you were in your computer.

  1. Download the sequence_1 and sequence_2 files from the moodle.

  2. Create a new folder called mystery_genes in your Lab_3 folder using cyberduck (Right click –> New Folder)

  3. Select the sequence_1 and sequence_2 files and drag them to the mystery_genes folder.

  4. Connect to the cluster using terminal/mobaxterm, go to the Lab_3/mystery_genes folder and check if the files are there. Answer the following question:


How did you check if the files were correctly uploaded?

#Add you answer here: I used ls -l in the mystery_genes folder


Checking file types

If you remember, we saw some file types for sequence data in the theory class. These are FASTA and GenBank.

We are going to deal with these files today, and see how we can extract important information from these files. First, answer the following questions:


Briefly summarize what is a FASTA and what is a GenBank file (one sentence each)

#Add you answer here: Fasta files have a header and the gene data following it. GenBank files have multiple headers and information on authors and publications then the gene file.

Can these formats include RNA, DNA and Aminoacid data? (Yes/No answer)

#Add you answer here: Yes they can include all of these


Now, use your knowledge of the command line to answer these questions:


What format is the sequence_1 file in? (FASTA or GenBank) Briefly explain your answer.

#Add you answer here: It is fasta format because it has a small header starting with a greater than sign, and then it has its sequence.

What format is the sequence_2 file in? (FASTA or GenBank) Briefly explain your answer.

#Add you answer here: It is a genbank file because it has a header with a lot more information on authors an publications along with the sequence.


If you answered correclty, you have seen that sequence_1 is in FASTA format, while sequence_2 is in GenBank format.

FASTA: Data handling

Answer these questions (and please ask the professor for help)


How many lines does the sequence_1 file has?

#Add you answer here: sequence 1 has 10 lines

How can you count the number of SEQUENCES that exist in the sequence_1 file?

#Add you answer here: It has 5 sequences total, I grepped the '>' and word counted it grep '>' sequence_1 | wc -l

What is the number of sequences in the sequence_1 file? Is this result different than the number of lines in the file? Why?

#Add you answer here: It has 5 sequences total, and it is different than the number of lines. Each sequence can have more than one line


Now, we know that the number of sequences (five) can be recognized by counting the number of > symbols in the sequence, right?

This means that for this particular FASTA file, you can extract each sequence individually if you know the lines.

For example: - lines 1 and 2 have the first sequence - lines 2 and 3 the second

and so on.

That means we can extract each sequence. You can do this using the sed command in the following way

sed -n 'staring_line,ending_line p' file_with_sequences

So, for example, to extract the first sequence in FASTA format:

sed -n '1,2 p' sequence_1

Now, because this is extracting in a fasta file, you will get the header and the sequence data. To only see the sequence data, use grep to extract all lines that DON’T have a > symbol. You can pipe the results of sed into grep!

sed -n '1,2 p' sequence_1 | grep -v ">" 

in this case, we had to put the > symbol between quotes because without them then the command line creates a new file.

Finally, we can even count the length of the sequence:

sed -n '1,2 p' sequence_1 | grep -v ">" | wc -c 

This first sequence is 7478 nucleotides long! Ain’t this cool?

Answer the following questions:


How can you save the new sequence in a new FASTA file?

#Add you answer here: we can use sed -n '1,2 p' sequence_1 > pizza.fasta

Extract all other sequences from the file and save them in 5 different FASTA files. Add the code you used below.

#Add you answer here: sed -n '1,2 p' sequence_1 > sequence_1_1
                      I repeated this for all ten lines for the five                         sequences

Homework 2:


Question 1: Can you create code that will do the following for each extracted FASTA sequence? 1. Count the number of A,G,C and T? 2. Count the length of each sequence? 3. Count the number of open reading frames (ORF)?

Present the code and the results

#Add you answer here: 

1. sed '2,2 p' sequence_1_1 | grep -v '>' | grep 'A' | wc -l 
        (for each file and letter)
2. sed '2,2 p' sequence_1_1 | grep -v '>' | wc -l 
        (for each file and letter)
3. '2,2 p' sequence_1_1 | grep -v '>' | grep -o 'ATG' | wc -l 
        (for each file and letter)

Section 4: GenBank Files

GenBank files are more complex than FASTA files.

These files are supremely complex to analyse only using the command line. However, we can take a look a this data and answer a couple of simple questions using the command line:


How many lines does the sequence_2 file has?

#Add you answer here: 310 lines (wc -l sequence_2)

Open the file using less -NS and answer the following questions: 1. If you were looking for the information on the gene that codes for lacZ, what line would you use? 2. What species is this gene taken from? 3. What genomic region does this represent? 4. How many other genes can you find?

#Add you answer here:

1. LacZ starts at line 130
2. It is taken from the E. Coli Bacteria
3. Gebomic DNA from E. Coli Bacteria
4. Contains LacA, LacI, LacY, and lacZ

Homework 3: Genic regions on GenBank files are of three different kinds: mRNA, gene, CDS.

Each of these represent a different genetic sequence.


What are the differences between these the kinds?

#Add you answer here:

- mRNA:Tells us what the mRNA of the specific gene does
- gene:Tells us what the gene name is
- CDS: The CDS contains the actual sequence of the gene

Not bad, but a little bit incomplete.

mRNA includes regions additional to just the coding sequence. In this case we see that a Repressor is included!.

Gene represents the area that spans the gene.

(-7 points)